AGI safety from first principles: Alignment

Parts of this section were rewritten in mid-October.

In the previous section, I discussed the plausibility of ML-based agents developing the capability to seek influence for instrumental reasons. This would not be a problem if they do so only in the ways that are aligned with human values. Indeed, many of the benefits we expect from AGIs will require them to wield power to influence the world. And by default, AI researchers will apply their efforts towards making agents do whatever tasks those researchers desire, rather than learning to be disobedient. However, there are reasons to worry that despite such efforts by AI researchers, AIs will develop undesirable final goals which lead to conflict with humans.

To start with, what does “aligned with human values” even mean? Following Gabriel and Christiano, I’ll distinguish between two types of interpretations. Minimalist (aka narrow) approaches focus on avoiding catastrophic outcomes. The best example is Christiano’s concept of intent alignment: “When I say an AI A is aligned with an operator H, I mean: A is trying to do what H wants it to do.” While there will always be some edge cases in figuring out a given human’s intentions, there is at least a rough commonsense interpretation. By contrast, maximalist (aka ambitious) approaches attempt to make AIs adopt or defer to a specific overarching set of values—like a particular moral theory, or a global democratic consensus, or a meta-level procedure for deciding between moral theories.

My opinion is that defining alignment in maximalist terms is unhelpful, because it bundles together technical, ethical and political problems. While it may be the case that we need to make progress on all of these, assumptions about the latter two can significantly reduce clarity about technical issues. So from now on, when I refer to alignment, I’ll only refer to intent alignment. I’ll also define an AI A to be misaligned with a human H if H would want A not to do what A is trying to do (if H were aware of A’s intentions). This implies that AIs could potentially be neither aligned nor misaligned with an operator—for example, if they only do things which the operator doesn’t care about. Whether an AI qualifies as aligned or misaligned obviously depends a lot on who the operator is, but for the purposes of this report I’ll focus on AIs which are clearly misaligned with respect to most humans.

One important feature of these definitions: by using the word “trying”, they focus on the AI’s intentions, not the actual outcomes achieved. I think this makes sense because we should expect AGIs to be very good at understanding the world, and so the key safety problem is setting their intentions correctly. In particular, I want to be clear that when I talk about misaligned AGI, the central example in my mind is not agents that misbehave just because they misunderstand what we want, or interpret our instructions overly literally (which Bostrom calls “perverse instantiation”). It seems likely that AGIs will understand the intentions of our instructions very well by default. This is because they will probably be trained on tasks involving humans, and human data—and understanding human minds is particularly important for acting competently in those tasks and the rest of the world.[1] Rather, my main concern is that AGIs will understand what we want, but just not care, because the motivations they acquired during training weren’t those we intended them to have.

The idea that AIs won’t automatically gain the right motivations by virtue of being more intelligent is an implication of Bostrom’s orthogonality thesis, which states that “more or less any level of intelligence could in principle be combined with more or less any final goal”. For our purposes, a weaker version suffices: simply that highly intelligent agents could have large-scale goals which are misaligned with those of most humans. An existence proof is provided by high-functioning psychopaths, who understand that other people are motivated by morality, and can use that fact to predict their actions and manipulate them, but nevertheless aren’t motivated by morality themselves.

We might hope that by carefully choosing the tasks on which agents are trained, we can prevent those agents from developing goals that conflict with ours, without requiring any breakthroughs in technical safety research. Why might this not work, though? Previous arguments have distinguished between two concerns: the outer misalignment problem and the inner misalignment problem. I’ll explain both of these, and give arguments for why they might arise. I’ll also discuss some limitations of using this framework, and an alternative perspective on alignment.

Outer and inner misalignment: the standard picture

We train machine learning systems to perform desired behaviour by optimising them with respect to some objective function—for example, a reward function in reinforcement learning. The outer misalignment concern is that we won’t be able to implement an objective function which describes the behaviour we actually want the system to perform, without also rewarding misbehaviour. One key intuition underlying this concern is the difficulty of explicitly programming objective functions which express all our desires about AGI behaviour. There’s no simple metric which we’d like our agents to maximise—rather, desirable AGI behaviour is best formulated in terms of concepts like obedience, consent, helpfulness, morality, and cooperation, which we can’t define precisely in realistic environments. Although we might be able to specify proxies for those goals, Goodhart’s law suggests that some undesirable behaviour will score very well according to these proxies, and therefore be reinforced in AIs trained on them. Even comparatively primitive systems today demonstrate a range of specification gaming behaviours, some of which are quite creative and unexpected, when we try to specify much simpler concepts.

One way to address this problem is by incorporating human feedback into the objective function used to evaluate AI behaviour during training. However, there are at least three challenges to doing so. The first is that it would be prohibitively expensive for humans to provide feedback on all data required to train AIs on complex tasks. This is known as the scalable oversight problem; reward modelling is the primary approach to addressing it. A second challenge is that, for long-term tasks, we might need to give feedback before we’ve had the chance to see all the consequences of an agent’s actions. Yet even in domains as simple as Go, it’s often very difficult to determine how good a given move is without seeing the game play out. And in larger domains, there may be too many complex consequences for any single individual to evaluate. The main approach to addressing this issue is by using multiple AIs to recursively decompose the problem of evaluation, as in Debate, Recursive Reward Modelling, and Iterated Amplification. By constructing superhuman evaluators, these techniques also aim to address the third issue with human feedback: that humans can be manipulated into interpreting behaviour more positively than they otherwise would, for example by giving them misleading data (as in the robot hand example here).

Even if we solve outer alignment by specifying a “safe” objective function, though, we may still encounter a failure of inner alignment: our agents might develop goals which differ from the ones specified by that objective function. This is likely to occur when the training environment contains subgoals which are consistently useful for scoring highly on the given objective function, such as gathering resources and information, or gaining power.[2] If agents reliably gain higher reward after achieving such subgoals, then the optimiser might select for agents which care about those subgoals for their own sake. (This is one way agents might develop a final goal of acquiring power, as mentioned at the beginning of the section on Goals and Agency.)

This is analogous to what happened during the evolution of humans, when we were “trained” by evolution to increase our genetic fitness. In our ancestral environment, subgoals like love, happiness and social status were useful for achieving higher inclusive genetic fitness, and so we evolved to care about them. But now that we are powerful enough to reshape the natural world according to our desires, there are significant differences between the behaviour which would maximise genetic fitness (e.g. frequent sperm or egg donation), and the behaviour which we display in pursuit of the motivations we actually evolved. Another example: suppose we reward an agent every time it correctly follows a human instruction, so that the cognition which leads to this behaviour is reinforced by its optimiser. Intuitively, we’d hope that the agent comes to have the goal of obedience to humans. But it’s also conceivable that the agent’s obedient behaviour is driven by the goal “don’t get shut down”, if the agent understands that disobedience will get it shut down—in which case the optimiser might actually reinforce the goal of survival every time it leads to a completed instruction. So two agents, each motivated by one of these goals, might behave very similarly until they are in a position to be disobedient without being shut down.[3]

What will determine whether agents like the former or agents like the latter are more likely to actually arise? As I mentioned above, one important factor is whether there are subgoals which reliably lead to higher reward during training. Another is how easy and beneficial it is for the optimiser to make the agent motivated by those subgoals, versus motivated by the objective function it’s being trained on. In the case of humans, for example, the concept of inclusive genetic fitness was a very difficult one for evolution to build into the human motivational system. And even if our ancestors had somehow developed that concept, they would have had difficulty coming up with better ways to achieve it than the ones evolution had already instilled in them. So in our ancestral environment there was relatively little selection pressure for us to be inner-aligned with evolution. In the context of training an AI, this means that the complexity of the goals we try to instil in it incurs a double penalty: not only does that complexity make it harder to specify an acceptable objective function, it also makes that AI less likely to become motivated by our intended goals even if the objective function is correct. Of course, late in training we expect our AIs to have become intelligent enough that they’ll understand exactly what goals we intended to give them. But by that time their existing motivations may be difficult to remove, and they’ll likely also be intelligent enough to attempt deceptive behaviour (as in the hypothetical example in the previous paragraph).

So how can we ensure inner alignment of AGIs with human intentions? This research area has received less attention than outer alignment so far, because it’s a trickier problem to get a grip on. One potential approach involves adding training examples where the behaviour of agents motivated by misaligned goals diverges from that of aligned agents. Yet designing and creating this sort of adversarial training data is currently much more difficult than mass-producing data (e.g. via procedurally-generated simulations, or web scraping). This is partly just because specific training data is harder to create in general, but also for three additional reasons. Firstly, by default we simply won’t know which undesirable motivations our agents are developing, and therefore which ones to focus on penalising. Interpretability techniques could help with this, but seem very difficult to create (as I’ll discuss further in the next section). Secondly, the misaligned motivations which agents are most likely to acquire are those which are most robustly useful. For example, it’s particularly hard to design a training environment where access to more information leads to lower reward. Thirdly, we are most concerned about agents which have large-scale misaligned goals. Yet large-scale scenarios are again the most difficult to set up during training, either in simulation or in the real-world. So there’s a lot of scope for more work addressing these problems, or identifying new inner alignment techniques.

A more holistic view of alignment

Outer alignment is the problem of correctly evaluating AI behaviour; inner alignment is the problem of making the AI’s goals match those evaluations. To some extent we can treat these as two separate problems; however, I think it’s also important to be aware of the ways in which the narrative of “alignment = outer alignment + inner alignment” is incomplete or misleading. In particular, what would it even mean to implement a “safe” objective function? Is it a function that we want the agent to actually maximise? Yet while maximising expected reward makes sense in formalisms like MDPs and POMDPs, it’s much less well-defined when the objective function is implemented in the real world. If there’s some sequence of actions which allows the agent to tamper with the channel by which it’s sent rewards, then “wireheading” by maxing out that channel will practically always be the strategy which allows the agent to receive the highest reward signal in the long term (even if the reward function heavily penalises actions leading up to wireheading).[4] And if we use human feedback, as previously discussed, then the optimal policy will be to manipulate or coerce the supervisors into giving maximally positive feedback. (There’s been some suggestion that “myopic” training could solve problems of tampering and manipulation, but as I’ve argued here, I think that it merely hides them.)

A second reason why reward functions are a “leaky abstraction” is that any real-world agents we train in the foreseeable future will be very, very far away from the limit of optimal behaviour on non-trivial reward functions. In particular, they will only see rewards for a tiny fraction of possible states. Furthermore, if they’re generalisation-based agents, they’ll often perform new tasks after very little training directly on those tasks. So the agent’s behaviour in almost all states will be primarily influenced not by the true value of the reward function on those states, but rather by how it generalises from previously-collected data about other states.[5] This point is perhaps an obvious one, but it’s worth emphasising because there are so many theorems about the convergence of reinforcement learning algorithms which rely on visiting every state in the infinite limit, and therefore tell us very little about behaviour after a finite time period.

A third reason is that researchers already modify reward functions in ways which change the optimal policy when it seems useful. For example, we add shaping terms to provide an implicit curriculum, or exploration bonuses to push the agent out of local optima. As a particularly safety-relevant example, neural networks can be modified so that their loss on a task depends not just on their outputs, but also on their internal representations. This is particularly useful for influencing how those networks generalise—for example, making them ignore spurious correlations in the training data. But again, it makes it harder to interpret reward functions as specifications of desired outcomes of a decision process.

How should we think about them instead? Well, in trying to ensure that AGI will be aligned, we have a range of tools available to us—we can choose the neural architectures, RL algorithms, environments, optimisers, etc, that are used in the training procedure. We should think about our ability to specify an objective function as the most powerful such tool. Yet it’s not powerful because the objective function defines an agent’s motivations, but rather because samples drawn from it shape that agent’s motivations and cognition.

From this perspective, we should be less concerned about what the extreme optima of our objective functions look like, because they won’t ever come up during training (and because they’d likely involve tampering). Instead, we should focus on how objective functions, in conjunction with other parts of the training setup, create selection pressures towards agents which think in the ways we want, and therefore have desirable motivations in a wide range of circumstances.[6] (See this post by Sanjeev Arora for a more mathematical framing of a similar point.)

This perspective provides another lens on the previous section’s arguments about AIs which are highly agentic. It’s not the case that AIs will inevitably end up thinking in terms of large-scale consequentialist goals, and our choice of reward function just determines which goals they choose to maximise. Rather, all the cognitive abilities of our AIs, including their motivational systems, will develop during training. The objective function (and the rest of the training setup) will determine the extent of their agency and their attitude towards the objective function itself! This might allow us to design training setups which create pressures towards agents which are still very intelligent and capable of carrying out complex tasks, but not very agentic—thereby preventing misalignment without solving either outer alignment or inner alignment.

Failing that, though, we will need to align agentic AGIs. To do so, in addition to the techniques I’ve discussed above, we’ll need to be able to talk more precisely about what concepts and goals our agents possess. However, I am pessimistic about the usefulness of mathematics in making such high-level claims. Mathematical frameworks often abstract away the aspects of a problem that we actually care about, in order to make proofs easier—making those proofs much less relevant than they seem. I think this criticism applies to the expected utility maximisation framework, as discussed previously; other examples include most RL convergence proofs, and most proofs of robustness to adversarial examples. Instead, I think we will need principles and frameworks similar to those found in cognitive science and evolutionary biology. I think the categorisation of upstream vs downstream inner misalignment is an important example of such progress;[3:1] I’d also like to see a framework in which we can talk sensibly about gradient hacking,[7] and the distinction between being motivated by a reward signal versus a reward function.[4:1] We should then judge reward functions as “right” or “wrong” only to the extent that they succeed or fail in pushing the agent towards developing desirable motivations and avoiding these sorts of pathologies.

In the final section, I will address the question of whether, if we fail, AGIs with the goal of increasing their influence at the expense of humans will actually succeed in doing so.


  1. ↩︎

    Of course, what humans say we want, and what we act as if we want, and what we privately desire often diverge. But again, I’m not particularly worried about a superintelligence being unable to understand how humans distinguish between these categories, if it wanted to.

  2. ↩︎

    Note the subtle distinction between the existence of useful subgoals, and my earlier discussion of the instrumental convergence thesis. The former is the claim that, for the specific tasks on which we train AGIs, there are some subgoals which will be rewarded during training. The latter is the claim that, for most goals which an AGI might develop, there are some specific subgoals which will be useful when the AGI tries to pursue those goals while deployed. The latter implies the former only insofar as the convergent instrumental subgoals are both possible and rewarded during training. Self-improvement is a convergent instrumental subgoal, but I don’t expect most training environments to support it, and those that do may have penalties to discourage it.

  3. ↩︎↩︎

    In fact these two examples showcase two different types of inner alignment failure: upstream mesa-optimisers and downstream mesa-optimisers. When trained on a reward function R, upstream mesa-optimisers learn goals which lead to scoring highly on R, or in other words are causally upstream of R. For example, humans learning to value finding food since it leads to greater reproductive success. Whereas downstream mesa-optimisers learn goals that are causally downstream of scoring highly on R: for example, they learn the goal of survival, and realise that if they score badly on R, they’ll be discarded by the optimisation procedure. This incentivises them to score highly on R, and hide their true goals—an outcome called deceptive alignment. See further discussion here.

  4. ↩︎↩︎

    One useful distinction here is between the message, the code, and the channel (following Shannon). In the context of reinforcement learning, we can interpret the message to be whatever goal is intended by the designers of the system (e.g. win at Starcraft); the code is real numbers attached to states, with higher numbers indicating better states; and the channel is the circuitry by which these numbers are passed to the agent. We have so far assumed that the goal the agent learns is based on the message its optimiser infers from its reward function (albeit perhaps in a way that generalises incorrectly, because it can be hard to decode the intended message from a finite number of sampled rewards). But it’s also possible that the agent learns to care about the state of the channel itself. I consider pain in animals to be one example of this: the message is that damage is being caused; the code is that more pain implies more damage (as well as other subtleties of type and intensity); and the channel is the neurons that carry those signals to our brains. In some cases, the code changes—for example, when we receive an electric shock but know that it has no harmful effects. If we were only concerned with the message, then we would ignore those cases, because they provide no new content about damage to our body. Yet what actually happens is that we try to prevent those signals being sent anyway, because we don’t want to feel pain! Similarly, an agent which was trained via a reward signal may desire to continue receiving those signals even when they no longer carry the same message. Another way of describing this distinction is by contrasting internalisation of a base objective versus modeling of that base objective, as discussed in section 4 of Risks from Learned Optimisation in Advanced Machine Learning Systems.

  5. ↩︎

    The mistake of thinking of RL agents solely as reward-maximisers (rather than having other learned instincts and goals) has an interesting parallel in the history of the study of animal cognition, where behaviorists focused on the ways that animals learned new behaviours to increase reward, while ignoring innate aspects of their cognition.

  6. ↩︎

    One useful example is the evolution of altruism in humans. While there’s not yet any consensus on the precise evolutionary mechanisms involved, it’s notable that our altruistic instincts extend well beyond the most straightforward cases of kin altruism and directly reciprocal altruism. In other words, some interaction between our direct evolutionary payoffs, and our broader environment, led to the emergence of quite general altruistic instincts, making humans “safer” (from the perspective of other species).

  7. ↩︎

    See Evan Hubinger’s post: “Gradient hacking is a term I’ve been using recently to describe the phenomenon wherein a deceptively aligned mesa-optimizer might be able to purposefully act in ways which cause gradient descent to update it in a particular way.”